PARQUET-446: Hide Thrift compiled headers and Boost from public API, #include scrubbing by wesm · Pull Request #49 · apache/parquet-cpp

wesm · 2016-02-12T23:21:44Z

This is the completion of work I started in PARQUET-442. This also resolves PARQUET-277 as no boost headers are included in the public API anymore.

I've done some scrubbing of #includes using Google's Clang-based include-what-you-use tool. PARQUET-522 can also be resolved when this is merged.

…e of deserialized-related internal headers and code paths. Add unit test to enforce this

…file

wesm · 2016-02-13T02:37:58Z

OK, I finished a bunch of include-what-you-use cleanup, so we can also resolve PARQUET-522 when this is merged.

Note that the wall time for compiling libparquet.so is about 10% faster. I'm seeing 18.2s before and 16.3s after.

majetideepak · 2016-02-13T15:39:13Z

src/parquet/column/scanner.h


 protected:
-  size_t batch_size_;
+  int32_t batch_size_;


Can we keep this int64_t? Why should we limit the batch_size to 2GB of a byte type?

Will make int64_t -- data pages are limited to int32_t but a row group could in theory be larger.

majetideepak · 2016-02-13T16:52:35Z

I am little concerned with having our own types for DataPage etc. We now have a bunch of code to construct these data pages and other structures. This code will not be useful for the write path. We will have to again use the set methods of the parquet types to construct objects like parquet::DataPageHeader etc. Can we just use the parquet types to begin with? This way we will also be heading closer to the write path.

wesm · 2016-02-13T17:08:36Z

@majetideepak Well, we have a hard constraint, which is that exposing Thrift includes in the public API is not acceptable. What I've done is in line with parquet-mr's equivalent code (I didn't actually look until now https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/page/DataPageV1.java)

From a software layering point of view it is better (IMHO) to decouple serialization-deserialization matters as much as possible.

I'm happy of course to look at an alternate proposal that satisfies the constraint that developers using parquet-cpp as a 3rd-party library do not need to include Thrift libraries and headers in their build system.

wesm · 2016-02-13T17:09:17Z

@julienledem can you review and comment? Thank you

majetideepak · 2016-02-13T22:20:09Z

I see your point with respect to Serializer and Deserializer and parquet-mr.
parquet-cpp should be able to use any serializer and deserializer not just thrift.

majetideepak · 2016-02-13T22:35:57Z

src/parquet/file/reader-internal.cc

+// assembled in a serialized stream for storing in a Parquet files
+
+SerializedPageReader::SerializedPageReader(std::unique_ptr<InputStream> stream,
+    Compression::type codec) :


Should we name this ThriftSerializedPageReader since it is specific to thrift? If we want to move on the lines of parquet-mr, I believe there should be further de-coupling from thrift and this Reader. We probably should have another module like parquet-cpp/parquet-thrift.

This could be done in a further downstream patch but I want to make sure we are going on the right path. @julienledem, @nongli and others, please pitch in your experience with parquet-mr here.

Alternatively, we could make parquet-cpp similar to parquet-mr/parquet-thrift (if I understood it correctly) with an additional ability to read and write "batch size" columns.

I don't agree with this line of reasoning. The Parquet specification uses Thrift for its metadata serialization, which is separate from code that can marshal materialized/reassembled records to and from Thrift or some other RPC/Serialization format (which is the role of parquet-thrift, parquet-avro, and other adapter components in parquet-mr). See for example: https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftToParquetFileWriter.java

parquet-thrift in parquet-mr is not related to reading parquet metadata (which happens to use thrift) in parquet-format

We use thrift in 2 places:

read the parquet metadata in the file. We want that one encapsulated and invisible to the users of the library. This is why there's a separate object model additionally to the thrift idl file. The thrift IDL is here to make sure we stay backwards compatible and we have a strict spec for it. The other model is more convenient to work with and abstracts thrift out.

parquet-thrift is the thrift integration to let users write parquet files using a thrift IDL to define their domain specific schema. It depends on the core parquet components. Not the other way around. It is at the same level as parquet-protobuf, parquet-avro, parquet-pig ...

In particular hiding the thrift version used for the metadata will make sure that it does not conflict with the version used by the end user for their own code generation.
I hope this helps.

@wesm and @julienledem thanks! for the clarification.

majetideepak · 2016-02-14T16:18:20Z

+1 LGTM

julienledem · 2016-02-15T23:54:37Z

+1 LGTM

wesm · 2016-02-16T00:10:39Z

thank you!

wesm added 7 commits February 12, 2016 18:12

Remove Thrift compiled headers from public API and general use outsid…

b4b0412

…e of deserialized-related internal headers and code paths. Add unit test to enforce this

Add more headers to parquet.h public API

9458b36

Remove serialized-page.* files, move serialized-page-test to parquet/…

07059ca

…file

Remove any boost #include dependencies

2e39062

Remove outdated TODO

5be40d6

Some initial IWYU

6d4af8e

Finished IWYU path. Some imported impala code left unchanged for now

9e28fc3

wesm force-pushed the PARQUET-446 branch from 2701489 to 9e28fc3 Compare February 13, 2016 02:32

wesm changed the title ~~PARQUET-446: Hide Thrift and Boost compiled headers from public API~~ PARQUET-446: Hide Thrift and Boost compiled headers from public API, #include scrubbing Feb 13, 2016

wesm added 2 commits February 12, 2016 19:08

Refactor monolithic encodings/encodings.h

4c02d2b

Fix mixed-up include guard names

503b1c1

majetideepak reviewed Feb 13, 2016
View reviewed changes

Use int64_t for scanner batch sizes

e805a0c

wesm changed the title ~~PARQUET-446: Hide Thrift and Boost compiled headers from public API, #include scrubbing~~ PARQUET-446: Hide Thrift compiled headers and Boost from public API, #include scrubbing Feb 13, 2016

majetideepak reviewed Feb 13, 2016
View reviewed changes

This was referenced Feb 14, 2016

PARQUET-526: Add unit tests for Scanner. #50

Closed

PARQUET-515: Add "SetData" to LevelDecoder #51

Closed

asfgit closed this in b71e826 Feb 15, 2016

wesm deleted the PARQUET-446 branch February 16, 2016 00:10

Conversation

wesm commented Feb 12, 2016

Uh oh!

wesm commented Feb 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

majetideepak commented Feb 13, 2016

Uh oh!

wesm commented Feb 13, 2016

Uh oh!

wesm commented Feb 13, 2016

Uh oh!

majetideepak commented Feb 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

majetideepak commented Feb 14, 2016

Uh oh!

julienledem commented Feb 15, 2016

Uh oh!

wesm commented Feb 16, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants